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Preface 



First of all, we want to congratulate two new research communities from Mex- 
ico and Brazil that have recently joined the Iberoamerican community and the 
International Association for Pattern Recognition. We believe that the series 
of congresses that started as the “Taller Iberoamericano de Reconocimiento de 
Patrones (TIARP)”, and later became the “Iberoamerican Congress on Pattern 
Recognition (CIARP)”, has contributed to these group consolidation efforts. We 
hope that in the near future all the Iberoamerican countries will have their 
own groups and associations to promote our areas of interest; and that these 
congresses will serve as the forum for scientific research exchange, sharing of ex- 
pertise and new knowledge, and establishing contacts that improve cooperation 
between research groups in pattern recognition and related areas. 

CIARP 2004 (9th Iberoamerican Congress on Pattern Recognition) was the 
ninth in a series of pioneering congresses on pattern recognition in the Iberoamer- 
ican community. As in the previous year, CIARP 2004 also included worldwide 
participation. It took place in Puebla, Mexico. The aim of the congress was 
to promote and disseminate ongoing research and mathematical methods for 
pattern recognition, image analysis, and applications in such diverse areas as 
computer vision, robotics, industry, health, entertainment, space exploration, 
telecommunications, data mining, document analysis, and natural language pro- 
cessing and recognition, to name a few. 

CIARP 2004 was organized by the Computer Science Department of the Na- 
tional Institute of Astrophysics, Optics and Electronics (INAOE) , the Center for 
Computing Research of the National Polytechnic Institute (CIC-IPN) and the 
University of Las Americas, Puebla (UDLAP), and was sponsored by the Insti- 
tute of Cybernetics, Mathematics and Physics of Cuba (ICIMAF), the Center of 
Applications of Advanced Technology of Cuba (CENATAV), the University of 
La Salle, Mexico (ULSA), the Autonomous University of Puebla (BUAP), the 
International Association for Pattern Recognition (I APR), the Cuban Associa- 
tion for Pattern Recognition (ACRP), the Portuguese Association for Pattern 
Recognition (APRP), the Spanish Association for Pattern Recognition and Im- 
age Analysis (AERFAI), the Special Interest Group on Pattern Recognition of 
the Brazilian Computer Society (SIGPR.-SBC), and the Mexican Association for 
Computer Vision, Neurocomputing and Robotics (MACVNR). 

We received contributions from 18 countries. In total 158 papers were sub- 
mitted, out of which 87 were accepted for publication in these proceedings and 
for presentation at the conference. The review process was carried out by the 
Scientific Committee, each paper being assessed by at least two reviewers who, 
in conjunction with other reviewers prepared an excellent selection dealing with 
ongoing research. We are especially indebted to them for their efforts and the 
quality of the reviews. 
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Three professors were invited to give keynote addresses on topics in pattern 
recognition: Dr. Josef Kittler, Professor at the School of Electronics and Phys- 
ical Sciences, University of Surrey, UK, Dr. Alberto Del Bimbo, University of 
Florence, Italy, and Dr. Eduardo Bayro Corrochano, Computer Science Depart- 
ment, Center of Research and Advanced Studies, Guadalajara, Mexico. 

We would like to thank the members of the organizing committee for their 
enormous efforts that allowed for an excellent conference and proceedings. 
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of Sports Videos 
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{i .kolonias ,w . Christmas , j . kittler}@surrey . ac .uk 
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Abstract. The interpretation by a human of a scene in video material 
is heavily influenced by the context of the scene. As a result, researchers 
have recently made more use of context in the automation of scene un- 
derstanding. In the case of a sports video, useful additional context is 
provided by formal sets of rules of the sport, which can be directly ap- 
plied to the understanding task. Most work to date has used the context 
at a single level. However we claim that, by using a multilevel contex- 
tual model, erroneous decisions made at a lower can be avoided by the 
influence of the higher levels. In this work, we explore the use of a mul- 
tilevel contextual model in understanding tennis videos. We use Hidden 
Markov models as a framework to incorporate the results of the scene 
analysis into the contextual model. Preliminary results have shown that 
the proposed system can successfully recover from errors at the lower 
levels. 



1 Introduction 

Constructing cognitive vision systems which can extract knowledge from visual 
data so as to derive understanding of the scene content and its dynamics, as 
well as to facilitate reasoning about the scene, has recently moved to the top 
of the research agenda of the computer vision research community. One of the 
many potential applications of such systems is to provide automatic annotation 
of videos so that the user can retrieve the desired visual content using iconic 
or linguistic queries in a user-friendly way. This would help to facilitate access 
to the huge quantities of video information currently stored in archives (and 
growing daily) and make its retrieval more efficient. 

However, it is generally accepted that automatic analysis and interpretation 
of video is a challenging problem in computer vision for a number of reasons. First 
of all, there is no direct access to 3D information, as video material is invariably 
captured using a monocular camera. Second, as the camera is not under our 
control, we are not able to invoke the active vision paradigm which might help 
us to interpret a scene by directing the camera to observe each scene object from 
different views. Third, very little is typically known about the camera system 
that captured the data. 



A. Sanfeliu et al. (Eds.): CIARP 2004, LNCS 3287, pp. 1-12, 2004. 
(c) Springer- Verlag Berlin Heidelberg 2004 




2 



Ilias Kolonias, William Christmas, and Josef Kittler 



On the positive side, a video captures contextual information which can be 
made to play a crucial role in the cognitive process. The merit of contextual 
information in sensory data perception has been demonstrated in application 
domains such as optical character recognition (OCR) and speech processing. For 
instance, in the former case, sequences of interpreted characters must constitute 
valid words. Thus, neighbours (preceding and ensuing characters) convey crucial 
contextual information for the correct recognition of each character. Similarly, at 
the next level, language grammar provides context aiding the correct interpre- 
tation of words by restricting the admissible sequences of word types and form. 
At the language understanding level, the circumstances relevant to the subject 
of discourse furnish contextual clues enabling a text processing system to gain 
understanding of the content. 

In contrast, the use of context in visual data processing is less prevalent. 
Historically, the focus has been on dealing with the 3D nature of objects under 
varying illumination which makes object recognition in computer vision a much 
harder problem than OCR or speech perception. Moreover, in classical computer 
vision scenarios, such as a robot assembly cell or an office, both spatial and 
temporal context are relatively limited. For instance, key, books, cups can be 
placed almost anywhere and the scenes are largely static. Consequently, there are 
hardly any scene evolution rules that would restrict scene events in a productive 
way to assist interpretation. The lack of motivation for contextual processing 
has limited the development of tools that could be used to model and exploit 
contextual information in a routine manner. 

However, in a number of application domains addressed more recently the 
contextual information is much richer and can play an important role in scene 
interpretation. A typical example is sports videos where the scene evolution is 
governed by strict rules. In this paper we explore the role of context in scene 
interpretation, with a focus on tennis videos. We demonstrate that in this event- 
driven sport, the detection of visual events and highlights, as well as the symbolic 
interpretation of the state of play, can be aided by contextual decision making. 
We construct a multilevel contextual model of tennis game evolution based on 
the hidden Markov model approach. We then use this model to annotate a 
tennis video. We show that the model has the capacity to derive the correct 
interpretation of the scene dynamics, even in the presence of low-level visual 
event detection errors. 

The paper is organised as follows. In Section 2 we review related work. The 
contextual model adopted is developed in Section 3. The resulting interpretation 
scheme is described in Section 4. The method is evaluated experimentally in 
Section 5. Conclusions are drawn in Section 6. 



2 Related Work 

In order to provide a high-level analysis of video material, there has to be some 
means of bridging the gap between the contextual information generated by 
the visual processing modules and the set of assumptions that humans make 
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when viewing the same material. In general, these assumptions are often fuzzy 
and ill-defined, which makes the bridging task difficult. However, as we have 
mentioned, in the case of sports material, we are helped by the existence of a set 
of well-defined rules that govern the progress of a sporting event; these rules can 
potentially provide strong additional context. For these reasons, Hidden Markov 
Models (HMMs) and other variants of Dynamic Bayesian Networks (DBNs) 
are often employed to bridge the semantic gap in sports video analysis. There 
have been several successful attempts of such use of HMMs in relation to the 
analysis and annotation of sporting events. Here we discuss some of this work, in 
particular relating to tennis [1,2], snooker [3], Formula 1 racing [4] and baseball 
[5]. The hand gesture recognition system of [6] is also relevant. 

— In [4] the authors are developing a complete cognitive audio/ visual system 
for the analysis of videos of Formula 1 race events. They extract semantic 
information from both the audio and the visual content of the sequence, and 
annotate the material by detecting events perceived as highly important. For 
example they include visual events such as overtaking, and cars running off 
the road. In addition, as the sequences used came from off-air video, they 
also extracted and used textual information about the race, including the 
drivers’ classification and times. Since off-air material was used, the audio 
part of the sequences was dominated by the race commentary; from that, 
features such as voice intensity and pause rates were used. Having extracted 
all of this information, the authors attempted to infer events of semantic im- 
portance through the use of Dynamic Bayesian Networks, inferring content 
semantics either by using audio and video information separately or com- 
bining this information in the temporal domain. Both approaches yielded 
promising results when tested on simple queries (for example finding shots 
in the Formula 1 race where a car runs out of the race track) . 

— In [1] Petkovic et al. use HMMs in order to classify different types of strokes 
in tennis games, using the body positions of the players. The hit types de- 
tected include fore-hands, back-hands, serves, etc. To do this, the authors 
initially segmented the players out of the background; then they used Fourier 
descriptors to describe the players’ body positioning; finally they trained a 
set of HMMs to recognise each type of hit. The results of this work show 
that this method can be quite successful in performing the recognition task 
it was designed for. 

— Kijak et al. [2] also use an HMM to classify a tennis game into these scenes: 
first missed serve, rally, replay, and commercial break. In this work they 
used the HMM to fuse both visual cues, including dominant colours, spatial 
coherency and camera motion activity, and audio cues, including speech, 
applause, ball hits, noise and music. A second level of HMM is also used in 
this method, to reflect the syntax of a tennis game. 

— Video footage of snooker is used in the work of [3]. The white ball is de- 
tected and tracked, and its motion analysed to identify the strategy of the 
player. The table is partitioned into five strategic areas, which are used by 
a set of four HMMs to classify the play into five categories: shot-to-nothing, 
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in which a shot pots a single coloured ball, but with no further strategic 
benefit; building a break by keeping the white ball in the centre of the ta- 
ble; conservative play (similar to the shot-to-nothing, but with no coloured 
ball potted); escaping from a snooker; and a miss, in which no collision is 
detected. The maximum likelihood measures from each HMM are compared 
to generate the classification. 

— The work of Chang et al. [5] describes a method to extract highlights from 
baseball games. They use a fusion of static and dynamic information to clas- 
sify the highlights into home run, good hit, good catch and witlrin-diamond 
play. The static information is represented as statistical modules in the form 
of histograms, that can be classified into seven different types of play. The 
dynamic information is in the form of a set of HMMs, one for each highlight 
type. Each HMM models the transitions between the types of play that are 
representative of the corresponding highlight. 

— In [6], the authors use HMMs to analyse hand gestures. Their framework de- 
composes each complex event into its constituent elementary actions. Thus 
in one of the examples the authors have used, a gesture is broken down into 
simple hand trajectories, which can be tracked more successfully via HMMs. 
Then, they apply Stochastic Context-Free Grammars to infer the full ges- 
ture. Such a paradigm can be compared to a tennis match;if we consider all 
elementary events leading up to the award of a point in a tennis match to be 
the equivalent of the elementary gestures in this work, and the tennis rules 
related to score keeping as an equivalent of the grammar-based tracking of 
the full gesture the authors have implemented, we can easily see the under- 
lying similarities between the authors’ work and reasoning on tennis video 
sequences. 

Although these examples are a small subset of the applications that these 
inference tools have been tested upon, in each case the HMM (or DBN) provides 
a useful mechanism for classifying a sequence of detected events with labels that 
have some semantic content. However we note that in general the HMMs are 
working at a single semantic level. Thus if mistakes occur, higher level informa- 
tion cannot be brought to bear to influence the decision making. (The exception 
is [2]), although it is not clear to what extent the higher- level HMM is being 
used to drive decision making in the lower level one.) In any sports activity, 
there exists a set of rules that, in the case of professional events, can more or 
less be relied upon. In particular, in the case of a professional tennis tournament 
there exists a rich and complex set of formal rules that can be structured in a 
hierarchical fashion. In the work of [4] these rules are not exploited at all, and 
in our own previous work, only rules at the lowest level, relating to the award of 
points, are used. 

3 Contextual Model 

As we have seen, a large body of work has been done in the area of creating 
decision-making schemes; one of the most important pieces of work has been the 
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introduction of Hidden Markov Models (HMMs) [7, 8] . A formal definition of a 
Hidden Markov Model would be that it is a doubly stochastic process where we 
can only observe the outcomes of one of the processes; the underlying stochastic 
process cannot be directly observed, but can only be inferred through the existing 
observations, as if it was a function of the latter. A typical example of an HMM 
could be to consider ourselves in a room and be told about the results of a die 
being rolled in another room; since we don’t know how the die is rolled, we have 
to consider this procedure as a stochastic process - as is the outcome of the die 
roll itself. Since we only know the outcome of the latter process, the system (the 
roll and the result processes) is properly described by a Hidden Markov Model. 
The notation most commonly used in the literature for the parameters in HMMs 
is the following: 

— N, the number of distinct states in the model. In the context of a tennis 
match, it could be a set of possible playing states within a point (like ex- 
pecting a hit, point to one player etc.) 

— M, the number of distinct observation symbols for each state, i.e. the alpha- 
bet size. In this context, this could include the set of possible events within 
a tennis match. 

— t, the length of the observation sequence - in the scope of this analysis, it 
would be the length of a play rally. 

— i f will be the hidden state of the HMM at time t. Hence, 1 < it < N. 

— P(i* = k) denotes the probability of the state having the label k at time t. 

— Ot will be the observation symbol at time t - equivalent to ‘what was observed 
at time t?’ 

When using HMMs to perform inference on a given problem, their most 
useful feature is their ability to calculate the most likely state sequence from an 
observation sequence. This is done by applying the Viterbi algorithm[ 9,10], an 
inductive algorithm based on dynamic programming. 

The Markov property for the sequence of hidden states ,i T and the 

related observation sequence 0 \ , • • • , O t for any r > 2 can be stated as 

P(i T |* 1 ,---,i T - 1 ) = P(* T |i T “ 1 ) 

Assuming stationarity for i T , we can write, for r > 2 and 1 < k\, k 2 < N, 
p(r = fc 1 |r- 1 = fc 2 ) = p(A:i|fc 2 ) 

Now, the previous state of the process is characterised by the (known) probability 
measure P(i T-1 |Oi, • • • , O t -\). Thus the a priori probability at time r, which 
is P(i T \Oi, ■ ■ ■ , 0 T - 1 ), will be given by 

N 

P(i T = h |0 1; • • • , 0 T _r) = ^ P(h\k 2 ) • P(r- X = fc 2 |Oi, ■ ■ • , O r -!) 

/C2 — 1 

If we also assume that that 0 1 , • • • , O r are conditionally independent, ie. that 
P(0 r |P,0l,---,0 T _l) =P(Or\i T ) 
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the a posteriori probability at time r will be given by 



P(* T |Oi,---,O r ) 



P(i T |Oi,---,Q T _i) -PjOrin 

E£Li P{Or\i T = k) ■ P(U = k\Ou • • • , Or-l) 



This probability will be the optimal decision criterion at time r and, at the same 
time, the a priori probability for time r + 1. 

The Viterbi algorithm is appropriate for discovering the sequence of states 
that will yield a minimum path cost for the sequence. However, this is the case 
where we only want to extract the current state by only looking at the current 
observation symbol. This approach is bound to yield a large number of errors if 
faced with erroneous input data; therefore, we will need to compensate for that 
fact in some way if we are to use HMMs in a real-world system. There are two 
main methods of achieving this: 



— Using multiple hypotheses for the evolution of a given sequence - in this 
case, we allow more than one event sequence to develop and be updated 
simultaneously, and we eliminate those where a transition from its previous 
state to its current one is not allowed. 

— Directly taking under consideration the fact that the system will be using 
more than one observation to perform inference for a given event while de- 
veloping the appropriate mathematical formulation of this problem. That 
means that we need to assume that the Markov processes that describe the 
problem can use the subsequent event in order to decide whether the current 
one is valid or not. This mode of decision is known in the literature as the 
look-ahead ahead decision mode for a Markov chain and is analysed in detail 
in [11]. In our context, both the observation (ie. the game event) and the state 
(ie. state of play) sequences would fall into this category. 

The baseline technique can be generalised so as to also include future obser- 
vation patterns, like O t + i, in a similar way it can be generalised for lrigher-orcler 
Markov sources. Let us assume that, in time r, a decision has to be made on 
classifying an input pattern O t . The decision can then be postponed until the 
next pattern, O r +i, is also observed. Therefore, the probability required now 
is P(i T \Oi, ■ ■ • , O t , O r+ i). For that to be calculated, P(i T \Oi, ■ • • , O t ) is also 
required; this is calculated as we have previously seen. Therefore 



P(i T |0i,---,0 T ,0 T+ i) 



P(U|Oi,---,O r )-P(O r+ i|U) 

Eti P(Ot + 1 1 U = k) ■ P(%r = k\o lr ~,0 T ) 



and, since O i, • • • , O r + 1 are independent, 



P{i T \Oi, • • • , O t , O t + i) — 

P(i T \Ol,---,Or)-P{Or+l\i T ) 

E£=i E£=i P(Or+ 1 |u +1 = h) ■ P(fcr \k 2 ) ■ P(U = k 2 |Or, • • • , O r ) 
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4 Proposed Scheme 

In our context (the analysis of tennis video sequences), the rules of the game of 
tennis provide us with a very good guideline as to what events we will have to 
be capable of tracking efficiently, so as to follow the evolution of a tennis match 
properly. Such events would include: 

— The tennis ball being hit by the players 

— The ball bouncing on the court 

— The players’ positions and shapes (that is, body poses) 

— Sounds related to the tennis match (either from the commentary or the 
court) 

These events can be used as a basis on which to perform reasoning for events 
of higher importance, like awarding the current point from the events witnessed 
during play. Based on them, the full graphical model for the evolution and award 
of a point in a tennis match is illustrated in the graph of Figure 1. 




Fig. 1 . Graphical model for awarding a point in a tennis match 



As we can readily see from this diagram, it is a graphical model where a 
number of loops exist; the state transitions drawn with bold lines indicate where 
these loops close. Moreover, this graphical model only tackles the problem of 
awarding a single point in the match; there is some more detail to be added to 
it if we wish to include the awarding of games, sets or the full match. How these 
stages are going to be implemented will be discussed in more detail later in this 
section. Finally, this figure also shows us that, in order to address the problem 
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of ‘understanding’ the game of tennis more effectively, we will have to convert 
this complex evolution graph into a set of simpler structures. 

Thus, we propose to replace the original scene evolution model with a set 
of smaller models, each one trying to properly illustrate a certain scenario of 
the match evolution. The most important thing we need to ensure during this 
procedure is that, when we combine all of the models in this set, we must have a 
model equivalent to the original one (so that it reflects the rules of tennis). The 
set of sub-graphs proposed to replace the original one is illustrated in Figure 2. 



( 1st serve ) — »( 2nd serve) ( Ace ) ( Rally » I'nintliu ;ir D 

UT T 




Fig. 2. Switching model and its respective set of sub-models for awarding a point in a 
tennis match, as separated out from the original graphical model 



As we can see from the set of models above, we have opted for a more 
‘perceptual’ way of selecting the set of event chains that will formulate the new 
model set for our purposes. This design strategy is going to be particularly 
helpful in the (quite frequent, as it is expected) cases where the system receives 
a query to extract sequences which contain some specific event; since the scene 
description will consist of a series of events occurring in the match, such queries 
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can be dealt with relatively easily. Moreover, choosing to break the initial graph 
down to a number of sub-graphs and train each one of them separately will be 
beneficial in many more different ways. Some of the resulting benefits are: 

— Using probabilistic reasoning tools (like HMMs) will allow for correction of 
wrongly detected elementary events within the game - so we can have a point 
correctly awarded even if we haven’t tracked all events accurately from the 
start. 

— Since the models are simpler, we will not have to acquire a huge training data 
set; a relatively small amount of data per model will suffice for our analysis. 
This will also make the training procedure a much less computationally 
expensive process as well - due to both the reduced volume of the training 
data and the simplicity of the models as well. 

— In some cases, we can considerably speed up the training process due to prior 
knowledge for this type of content. For example, the amount of statistics 
available for events occurring in a tennis match helps us get a very good ini- 
tial estimate for the HMM parameters without the need of a training process; 
we only need to pick up some existing measurements for this purpose. 

— It will be easier to determine which areas (ie. sub-models) of the reasoning 
engine need to be improved to boost the overall performance of the system 
and which low-level feature extraction modules are more suspect to produce 
misleading results - so that we can either improve them or replace them 
with a different approach. 

Since the detection of such elementary events will be done using machine 
vision algorithms and techniques, we are bound to encounter event detection 
errors in the process. Hence, another crucial issue is to examine the robustness 
of the proposed scheme to false events in its input. To address this, it has been 
decided that the system will be implemented through the use of Hidden Markov 
Models with a look-ahead decision mechanism , therefore allowing us to take into 
account events that occur after the one currently examined. This will help the 
system establish whether the current event can actually occur or it needs to 
be corrected and re-examined, as the rules of tennis will allow us to detect 
and correct contradictory hypotheses. The length of the look-ahead window has 
been limited to 1 event forward, thus allowing us to correct isolated errors - 
if we encounter 2 or more consecutive errors, the output will generally not be 
correct. However, there are two reasons for this choice: 

— If the length of the look-ahead window is too large, small rallies (like aces) 
with errors in them may be wrongly interpreted. That can happen as, at 
the end of the chain, the Viterbi algorithm will not have enough evidence 
to correct errors. For cases like this, we will need to implement an error- 
correcting scheme on top of the proposed method, that will rely on different 
context - as will be discussed later on. 

— Out of the events required for correct evolution tracking in this context, the 
type of events that is clearly most susceptible to errors is the ball bounces - 
the difficulty of tracking a ball in a tennis sequence stemming from the fact 
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that it is a very small, fast-moving object. However, since only one bounce 
can occur between two successive player hits if the point is still to be played, 
detecting a player hit ( not a serve) automatically removes the chance of a 
point being awarded, even if the bounce point is wrongly detected (or not 
detected at all). 

This graph will be implemented through using a ‘switch ’ variable to select 
which of the constituent sub-models will be used to model the scene at a given 
moment. Therefore, the proposed system’s structure is similar to a Switching 
HMM. As soon as we have determined which way the points are going to be 
awarded, we can move on to the award of games and sets in the match. This can 
either be done through the use of probabilistic reasoning tools (such as HMMs), 
or with simpler, rule-based tools - such as grammars. The latter is possible due 
to the fact that at this level of abstraction in modelling tennis game sequences, it 
is the rules of the game of tennis that stipulate the evolution of the scene rather 
than low-level events in it. However, since the uncertainty stemming from the 
successful (or unsuccessful) detection of the elementary events mentioned above 
cannot always be considered to have been effectively addressed in the lower-level 
stages of the reasoning process (ie. up to the level of point awarding), we will have 
to opt for using probabilistic tools to perform higher levels of reasoning for these 
video sequences. However, the fact that these models will directly correspond to 
the rules of tennis and that most of the spurious low-level data have already been 
corrected will allow for far greater confidence in the output of the higher-level 
system. 

5 Results and Discussion 

The scheme described above has been tested on one hour’s play from the Men’s 
Final of the 2003 Australian Tennis Open. The sequence contained a total of 100 
points played - equivalent to approximately one and a half sets of the match. Out 
of these 100 exchanges, a total of 36 were played on a second serve. The data that 
was used as input in this experiment were ground-truth, hand- annotated event 
chains from the broadcast match video in the first case, and the same data with 
one error per point sequence randomly inserted into it in the second. Therefore, 
we have examined both the ability of the proposed scheme to model the evolution 
of the match accurately and the robustness of it at the same time. To do that, we 
had to introduce four sets of models - one for every combination of which player 
serves and which side of the court he/she serves from (left or right). Moreover, 
in the second case, we limited the system to only check the next event for error 
correction. In those 100 exchanges, we have intentionally left in a few unfinished 
points, so as to examine whether the selection of the hidden states for these 
models can lead to an accurate representation of the scene at any given time - 
not only at the end of the scene. They were 4 in total - 2 leading to a second 
serve and 2 were cut short while still on play. An overall view of the results is 
given in Tables 1 and 2 below. 
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Table 1. Total results for Ground- Truth Data 





Ground 

Truth 


Correctly 

Awarded 


Wrongly 

Awarded 


Not awarded 
(still on play) 


Near Player Points 


59 


59 


0 


0 


Far Player Points 


37 


37 


0 


0 


Unfinished Points 


4 


4 


0 


n/a 


TOTAL 


100 


100 


0 


0 



Table 2. Total results for Data with Errors Inserted 





Ground 

Truth 


Correctly 

Awarded 


Wrongly 

Awarded 


Not awarded 
(still on play) 


Near Player Points 


59 


57 


0 


2 


Far Player Points 


37 


35 


1 


1 


Unfinished Points 


4 


3 


1 


n/a 


TOTAL 


100 


95 


2 


3 



As we can see in Table 1, all of the points were successfully tracked by the 
proposed system when ground-truth data were used, whereas in the case of errors 
in the data set (shown in Table 2), we did have some errors in recognition. This 
happened because of the fact that, in those points, the very last event on the 
chain was wrong - and since there was no ’future’ to the event sequence after that 
event, the system had no way of inferring that this was wrong. However, this is a 
predictable problem, because this event is the deciding one for the point and the 
uncertainty in it cannot be removed; therefore, the uncertainty will propagate to 
the decision for this level as well. This observation shows that the application of 
a similar error-correcting scheme on top of the proposed system (in which point 
award is to be processed) will further enhance the robustness of the system. 

6 Conclusions 

As we can readily see from the results shown above, the proposed system has 
tackled the problem of tracking the evolution of a tennis match very effectively. 
Whereas in the case where no error input has been provided, such accuracy serves 
only to illustrate that the use of such an approach is not inherently wrong; 
it is the performance of the proposed scheme when errors were added in the 
observation sequences that makes it seem a very promisimg solution for a fully 
automated evolution tracking system for tennis videos. 

However, there are still some issues to be addressed in this area. First of 
all, the aim of developing such an automatic evolution tracking scheme is for 
use as an automatic video annotation system, in conjunction with a set of fully 
automatic low-level feature extraction tools that will be able to detect the ba- 
sic events required for input to the proposed system. As in any fully automatic 
computer vision system, the low-level feature extraction tools are bound to pro- 
duce recognition errors which will propagate to the proposed decision-making 
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scheme. Therefore, and although some testing has already been carried out for 
the proposed scheme, it is essential that the proposed scheme is also tested with 
the actual low-level feature extraction tools integrated into it, so that we can 
have a clearer view of what errors occur in the lower levels and provide accu- 
rate statistics about the performance and robustness of the higher levels of the 
inference engine as a whole. 

Moreover, the proposed system is only the first step in a hierarchical model 
that will fully describe the evolution of a tennis match - it will only cover the 
award of a single point in the match. The creation of a full system will require 
the design of a similar error-correcting scheme to award games, sets and finally 
the match - and which will all rely on the efficiency of this method. 
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Abstract. Along with images and videos, 3D models have recently 
gained increasing attention for a number of reasons: advancements in 
3D hardware and software technologies, their ever decreasing prices and 
increasing availability, affordable 3D authoring tools, and the establish- 
ment of open standards for 3D data interchange. 

The ever increasing availability of 3D models demands for tools support- 
ing their effective and efficient management. Among these tools, those 
enabling content-based retrieval play a key role. 

In this paper we report on our experience in developing models to support 
retrieval by content of 3D objects. Particularly, we present three different 
models for representing and comparing the content of 3D objects. A 
comparative analysis is carried out to evidence the actual potential of 
the proposed solutions. 



1 Introduction 

Digital multimedia information is nowadays spreading through all sectors of so- 
ciety and collections of multimedia documents are being created at an increasing 
pace in several domains. However, in order to exploit the valuable assets con- 
tained in these ever growing collections, some tool should be available to support 
users in the process of finding information out of these data. In recent years, as 
a result of the efforts spent in the attempt of finding solutions to this problem, 
many systems have been developed that enable effective retrieval from digital 
libraries, covering text, audio, images, and videos. 

Beside image and video databases, archives of 3D models have recently gained 
increasing attention for a number of reasons: advancements in 3D hardware and 
software technologies, their ever increasing availability at affordable costs, and 
the establishment of open standards for 3D data interchange (e.g. VRML, X3D). 

Acquisition of the 3D model of an object, capturing both object geometry 
and its visual features (surface color and texture), can be achieved through many 
different techniques, including CAD, 3D laser scanners, structured light systems 
and plrotogrammetry. The selection of the most appropriate technique depends 
on application specific quality requirements. Furthermore, these techniques re- 
sult in a large variety of models, differing in terms of their representation (e.g. 
point clouds, voxels, analytical functions), of their resolution and size, of the 
presence, nature, and amount of noise and artifacts. 



A. Sanfeliu et al. (Eds.): CIARP 2004, LNCS 3287, pp. 13-24, 2004. 
(c) Springer- Verlag Berlin Heidelberg 2004 
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Thanks to the availability of technologies for their acquisition, 3D models are 
being employed in a wide range of application domains, including medicine, com- 
puter aided design and engineering, and cultural heritage. In this framework the 
development of techniques to enable retrieval by content of 3D models assumes 
an ever increasing relevance. 

This is particularly the case in the fields of cultural heritage and historical 
relics, where there is an increasing interest in solutions enabling preservation of 
relevant artworks (e.g. vases, sculptures, and handicrafts) as well as cataloguing 
and retrieval by content. In these fields, retrieval by content can be employed to 
detect commonalities between 3D objects (e.g. the “signature” of the artist) or 
to monitor the temporal evolution of a defect (e.g. the amount of bending for 
wooden tables). 

Methods addressing retrieval of 3D models can be distinguished based on 
different aspects, such as the type of representation used for geometry, the use 
of information about models’ aspect (i.e. colour and/or texture), the need for 
manual annotation. 

Description and retrieval of 3D objects based on description and retrieval of 
2D views has been addressed in [1] and [2]. However, the effectiveness of these 
solutions is limited to description and retrieval of simple objects. In fact, as 
complex objects are considered, occlusions prevent to capture distinguishing 3D 
features using 2D views. 

Description of 3D surface data for the purpose of recognition or retrieval has 
been addressed for some time. A few authors have investigated analytical 3D 
models, but this is not always a viable solution, as there are many limitations 
in providing parameterizations of arbitrary models. In [3] retrieval of 3D objects 
based on similarity of surface segments is addressed. Surface segments model 
potential docking sites of molecular structures. The proposed approach develops 
on the approximation error of the surface. However, assumptions on the form of 
the function to be approximated limit applicability of the approach to special 
contexts. 

Much attention has been recently devoted to free- form (i.e. polygonal) meshes. 
While this representation of 3D models poses major hurdles to development and 
implementation of algorithms, it is indeed the most appealing field of applica- 
tion. The system developed within the Nefertiti project supports retrieval of 3D 
models based on both geometry and appearance (i.e. colour and texture) [4]. Also 
Kolonias et al. have used dimensions of the bounding box (i.e. its aspect ratios) 
and a binary voxel-based representation of geometry [5] . They further relied on 
a third feature, namely a set of paths, outlining the shape ( model routes). In [6] 
a method is proposed to select feature points which relies on the evaluation of 
Gaussian and median curvature maxima, as well as of torsion maxima on the 
surface. In [7], Elad et al. use moments (up to the 4-7tlr order) of surface points 
as basic features to support retrieval of 3D models. Differently from the case of 
2D images, evaluation of moments is not affected by (self-)occlusions. 

In this paper we report on the use of three models for representing the con- 
tent of a 3D object for the purpose of supporting retrieval by object similarity. 
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The three models are based on projection of surface curvature information and 
spin images. Projection of surface curvature information is obtained by warp- 
ing object surface until it becomes a function on the sphere. Then, information 
about curvature is projected onto a 2D curvature map that retains information 
about curvature distribution on the original object surface. Content of the 2D 
curvature map is described using two different techniques: histograms of map 
tiles and weighted walkthroughs of map regions. 

The third model for content representation is based on spin images. These 
capture geometric properties of 3D objects regardless of surface curvature. Since 
object description based on spin images entails a huge amount of information, 
feature extraction and clustering techniques are used to meet the specific storage 
and efficiency requirements of content-based retrieval. 

The paper is organized as follows: Sec. 2 describes some pre-processing steps 
that are necessary to apply both curvature maps and spin images techniques; 
Sec. 3 expounds on computation of curvature maps from 3D object meshes; Sec. 4 
describes how to use curvature maps for description of 3D objects content; Sec. 5 
introduces extraction of spin images from 3D objects; Sec. 6 describes how to use 
spin images for description of 3D objects content; finally in Sec. 7 a comparative 
analysis among the proposed techniques is presented. 

2 Preprocessing 

High resolution 3D models obtained through scanning of real world objects are 
often affected by high frequency noise, due to either the scanning device or the 
subsequent registration process. Hence, smoothing is required to cope with such 
models for the purpose of extracting their salient features. 

Selection of a smoothing filter is a critical step, as application of some fil- 
ters entails changes in the shape of the models. For instance, mean or Laplacian 
smoothing cause shrinking of the model (a known problem, which has been 
pointed out - for example - in [6]). In Laplacian smoothing, every vertex x is 
moved from its original location by an offset A(x); the offset is determined as 
a function of the neighbouring vertices of x, and a parameter A controls the 
strength of the filter. To avoid shrinking, we adopted the filter first proposed by 
Taubin [8]. This filter, also known as X\/i filter, operates iteratively, and inter- 
leaves a Laplacian smoothing weighed by A with a second smoothing weighed 
with a negative factor (A > 0 , // < —A < 0). This second step is introduced to 
preserve the model’s original shape. 

An additional pre-processing step is employed to reduce the complexity of the 
model (in terms of the number of vertices). To this end, an algorithm perform- 
ing an iterative contraction of vertex pairs (i.e. edges) is used: first, all edges 
are ranked according to a cost metric; then, the minimum cost vertex pair is 
contracted; finally, the costs are updated [9]. The algorithm is iterated until a 
predefined stop criterion is met: In our experiments, the stop criterion was set 
in terms of the number of polygons of the final model. 
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2.1 Curvature Estimation 



Estimation of surface curvature at a generic vertex Vi of the mesh is accomplished 
by considering variations of surface normal over the platelet V Vi of vertex v,. This 
guarantees less sensitivity to noise and acquisition errors. 

In particular, surface curvature in correspondence with the i-tlr vertex ty of 
the mesh A4 is estimated by considering versor v^~, that is, the normal to A4 
at point Vi . Then , the platelet V Vi of vertex Vi is considered. This is defined 
as the set of all mesh vertices around V{. Given a generic vertex of the platelet 
Vj € V Vi let vj- be the normal to M at point Vj . Mesh curvature j Vi at vertex 
Vi is estimated as: 



£„ j6 y«i \ v i~~ v f\ 
2 \V V *\ 



It can be shown that with this definition, the value of ') Vi is always in [0, 1]. 



3 Curvature Maps 

Given a 3D object, construction of its curvature map relies on warping object 
surface until it becomes a function on a sphere and then projecting curvature 
information onto a 2D image. Mesh deformation is obtained by iteratively ap- 
plying a smoothing operator to the mesh. In general, application of a smoothing 
operator is accomplished by updating the position of each vertex of the mesh 
according to the following formula: 

M(vi)®u==i — — ^2 u>j*Vi-Vi ( 2 ) 

l.vjev-i "'.i v . eV ^ 

being weights u = {u>j} characteristic of each operator and p a parameter used 
to control the amount of motion of each vertex and to guarantee stability and 
continuity of the smoothing process. 

Under the assumption of low p values, the iterative application of the smooth- 
ing operator to every vertex of the mesh is equivalent to an elastic deformation 
process. During the deformation process each vertex of the mesh should be moved 
in order to satisfy two sometimes opposite requirements: mesh regularization 
and curvature minimization. As demonstrated in previous work [10], applica- 
tion of Laplacian Smoothing , Taubin Smoothing , or Bilaplacian Flow operators 
increases mesh regularization but may result in unnatural deformations of the 
original mesh. Differently, application of Mean Curvature Flow operator doesn’t 
guarantee mesh regularization. 

To achieve both regularization and smoothing of the original mesh, the pro- 
posed solution develops on the application of two distinct operators at each 
step of the iterative deformation process. In particular, Laplacian and Gaussian 
smoothing operators are used in combination to achieve both mesh smoothing 
and regularization. Application of the two operators is iterated until the average 
value of vertex motion falls below a predefined threshold r. 
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3.1 Mapping 

Projection of a curved surface is a well known problem in cartography [11]. There 
are many different projections used to map (a part of) the globe onto a plane, 
but their description is far beyond the scope of this paper. In our approach, we 
have selected the Archimedes projection (also known as the Lambert equal-area 
projection). Similarly to the Mercator projection, the Archimedes projection is a 
cylindrical projection. In particular, it is the projection along a line perpendicular 
to the axis connecting the poles and parallel to the equatorial plane. Thus, a 
point on the sphere with latitude 0 and longitude is mapped into the point 
on the cylinder with the same longitudinal angle 0 and height sin(<P) above (or 
below) the equatorial plane. 

A major advantage of the Archimedes projection is that it is an area preserv- 
ing projection: all regions on the surface of the sphere are mapped into regions 
on the map having the same area. This guarantees that, regardless of the posi- 
tion on the sphere, the relevance of any region is the same both on the sphere 
and on the map. 

4 From Curvature Maps to Content Descriptors 

Ideally, once a 3D model is represented through a 2D curvature map, any ap- 
proach supporting image retrieval by visual similarity could be used to evaluate 
the similarity between two 3D models. In fact, this can be achieved by computing 
the similarity of the corresponding maps. 

In the proposed approach, information about curvature maps is captured 
at two distinct levels: tiles obtained by a uniform tessellation of the map, and 
homogeneous regions obtained by segmenting the map. In the former case, we use 
histograms to capture global properties of map tiles, whereas in the latter case 
we rely on weighted walkthroughs to describe spatial arrangement and local 
properties of regions on the map. Details on the two techniques are provided 
hereafter. 

4.1 Histogram-Based Description of Map Tiles 

A generic histogram H with n bins is an element of the histogram space 74" C 
3R”. Given an image and a quantization of a feature space, histogram bins count 
the number of occurrences of points of that quantized feature value in the image. 

Histograms also support a multi-resolution description of image features. 
Given a partitioning of an image into n fine-grained tiles, histograms provide a 
representation for the content of each of these tiles. 

In order to compute the similarity between two histograms, a norm must 
be defined in the histogram space. In our experiments the Kolmogorov- Smirnov 
distance was adopted. Thus, the distance between two histograms H and H' is 
computed as follows: 

T> ks (H, H') = ma x((q, h z ) (3) 

i 

being hi and h *-tlr element of the cumulated histogram of H and H\ respec- 
tively (i.e. hi = X)fe =1 M- 
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Computing the distance between two maps requires to find the best tiles 
correspondence function. This is defined as the permutation p : {l,...,n} — ■> 
{1, . . . , n} that minimizes the sum of distances between corresponding tiles. 

The solution p is approximated through a greedy search approach that re- 
quires to scan all tiles in the first map in a predefined order and associate to 
each tile the most similar tile not yet associated in the second map. This pairwise 
NN association yields a suboptimal solution. 

4.2 Weighted Walkthroughs Description of Map Tiles 

Description of map content through histograms is not able to capture neither the 
spatial arrangement nor the local properties of individual regions of the map. In 
some cases this can be a limitation, since information about individual regions 
and their spatial arrangement in the map is strictly related to information about 
shape and structure of the original 3D mesh. To overcome these limitations, 
the coarse description of map content is complemented with a local approach 
capturing local properties of individual regions in the map as well as their spatial 
arrangement. 

Local description of map content is based on weighted walkthroughs tech- 
nique [12]. In particular, description of map content is accomplished by seg- 
menting the map into regions characterized by uniform curvature values. For 
each region, information about region area and average curvature is retained. 
Furthermore, for each pair of regions, their relative position is captured through 
a 3 x 3 array corresponding to the weighted walkthroughs for the two regions. 

The use of weighted walkthroughs enables description of map content in the 
form of an attributed relational graph. Graph vertices correspond to regions of 
the map and are labelled with the region’s area and average curvature. Graph 
edges retain information about the relative position of regions they link and are 
labelled with the corresponding 3x3 weighted walkthroughs. 

The descriptor of content of a generic map can be represented as ( R,f,w ), 
being R the set of regions in the map, / the set of visual features capturing the 
appearance of each region (in our case region area and average curvature), and 
w the set of weighted walkthroughs capturing the relative position of each region 
pair. 

Computation of the similarity between two descriptors of map local content 
is equivalent to an error correcting subgraph isomorphism problem [13], which 
is an NP-complete problem with exponential time solution algorithms [14] . 

In the proposed approach, identification of the optimal node association func- 
tion is accomplished through the technique presented in [15]. This is based on a 
look-ahead strategy that extends classical state-space search approaches. 

5 Spin Images 

Spin images were introduced by Johnson and Hebert to support recognition of 
single objects in complex scenes [16]. Basically, spin images encode the density 
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of mesh vertices projected onto an object-centred space: the three-dimensional 
mesh vertices are first mapped onto a two-dimensional space defined w.r.t. to the 
object itself; the resulting coordinates are then used to build a two-dimensional 
histogram. 

More precisely, let O = (p, n) an oriented point on the surface of the object, 
where p is a point on the surface of the object and n the normal of the tangent 
plane in p. For a generic oriented point O , a spin map can be defined, which 
maps any point x in the three-dimensional space onto a two-dimensional space 
according to the following formula (see also Fig. 1 for notation): 

So(x) -*• [a,/3\ = W\\x-p\\ 2 - (n- (x - p)) 2 , n ■ (x - p)\ 

In other words, the oriented point defines a family of cylindrical coordinate 
systems, with the origin in p, and with the axis along n. The spin map projection 
of x retains the radial distance (a) and the elevation (/3), while it discards the 
polar angle. 

To produce a spin image of an object, a spin map is applied to points compris- 
ing the surface of the object. Hence, given a mesh representation of the object, 
the spin image can be obtained by applying the map to the vertices comprising 
the mesh. A simple binary image can be obtained by discretizing the projected 
coordinates and by setting the corresponding point on the image. However, more 
refined grey-level spin images encoding a measure of the density of vertices that 
insist upon the same image point are usually employed. To construct such an 
image, the projected coordinates a and j3 of each mesh vertex are used to up- 
date the two-dimensional histogram I(i,j) (i.e. the spin image) according to a 
bi-linear interpolation scheme that spreads the contribution of each vertex over 
the nearest points on the grid induced by the quantization of the image space. 

Most outstanding characteristics of spin images are invariance to rigid trans- 
formations (as a consequence of the adoption of an object-centred coordinate 
system), limited sensitivity to variations of position of mesh vertices (which 
might result from the adoption of different sampling schemes), flexibility (since 
no hypotheses are made on the surface representation), and ease of computation. 




Fig. 1 . Given an oriented point (p, n) on the object surface, a generic point x is mapped 
on point [a, 0\ on the spin map, being [a, (3\ the radial distance and the elevation of 
x w.r.t. to ( p,n ). a) the object centred 3D coordinates system, and b) the spin map 
coordinate system. 
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6 From Spin Images to Content Descriptors 

Spin images provide a powerful means to describe three-dimensional objects. 
However, the fact that many spin images are typically produced for a single 
object, and the fact that each image implies considerable storage requirements, 
prevent us to use them directly as descriptors for retrieval purposes. Therefore, 
we decided to rely on more compact descriptions extracted from spin images, 
synthesizing the content of each spin image. Our descriptor for spin images was 
inspired by region descriptions such as grid-based techniques or the shape ma- 
trix [17]. Instead of sampling the shape at the intersection between radial and 
circular lines, we decided to measure the relative density encompassed by each 
of the regions defined by those lines, so as to provide a more precise description 
of the spin image. We have defined three independent sets of regions for the spin 
image: sectors of circular crowns for both the half-plane with (3 > 0 and the half 
plane with (3 < 0, and circular sectors. Each of these sets defines a descriptor 
{C p = (cp 1 ,...,cp np ), C n = (cni, cn nn ), and S = (si, . . . , s ns ), respec- 
tively), whose components represent the amount of surface points (or vertices) 
whose projections fall within the corresponding crown/sector. 

Based on results of some preliminary experiments we chose np = nn = ns = 6 
as these represent a satisfactory trade-off between compactness and selectivity 
of the representation. Hence, a 18-dimensional descriptor D = (C p ,C n ,S) is 
evaluated for each spin image. 




Fig. 2. Compound object descriptors comprise descriptors for a) np crowns in the half- 
plane /3 > 0, b) nn crowns in the half-plane (3 < 0, c) ns sectors. In our experiments 
np = nn = ns = 6. 



In order to avoid use of one spin image descriptor for each mesh vertex, 
spin image descriptors are subject to clustering. For this purpose we relied on 
fuzzy clustering [18], which is an extension to c-means procedure that avoids 
partitioning feature vectors into hard or crisp clusters. Through clustering, we 
represent the original set of spin image descriptors {SI\, . . . , SI m } ( Sit € IR 18 ) 
- each descriptor being associated with one mesh vertex - with a compact set 
represented by the clusters’ centers. 

Computation of the similarity between two 3D objects is accomplished by 
comparing their descriptors, each descriptor being in the form of a set of cluster 
centers T> = {Di, i = 1 . . .}. 
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Fig. 3. Retrieval of Mercur statues using spin images. All models of the Mercur statues 
are retrieved first, followed by models of other statues which display similar shapes. 



Computing the distance A between two descriptors T> and T> requires to find 
the best cluster-center correspondence function. This is defined as the permu- 
tation p : {1, . . . , 1} — » {1, . . . , k} that minimizes the sum of distances between 
corresponding cluster centers. 

7 Comparative Analysis 

Approximately 300 models were collected to build the test database. These com- 
prise four classes of models: taken from the web, manually authored (with a 3D 
CAD software) , high quality 3D scans from the De Espona 3D Models Encyclo- 
pedia 1 , and variations of the previous three classes (obtained through geometric 
deformation or application of noise, which caused points surface to be moved 
from their original locations) . 

Fig. 3 shows a retrieval example using spin images. The query is the model 
of a statue portraying Mercur. The result set displays all models of the Mercur 
statue in the first five positions. In general, all retrieved models feature similar 
shapes, characterized by a main body and protrusions that resemble Mercur’s 
elongated arm and leg. Fig. 4 shows a retrieval example where the model of a 
Satyr’s bust is used as a query. The result set displays all models of the Satyr’s 
bust in the first five positions. Other models reproducing busts are also retrieved. 

Among the different techniques reviewed in Section 1, we selected the cur- 
vature histograms [19] and moments of surface points [7] for a comparative as- 

http : //www.deespona. com 
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Fig. 4. A retrieval example, using the model of a Satyr’s bust as the query. Other 
models of the Satyr’s bust are retrieved first, followed by models of other busts. 




Recall 

Fig. 5. Comparison of precision/recall figures for the four methods: curvature his- 
tograms, moments, weighted walkthroughs of curvature maps and spin images. 
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sessment. Performance comparison is assessed through four sample queries that 
were submitted to each of the four retrieval engines. Average precision vs. recall 
curves are shown in Fig. 5. 

The comparative evaluation shows that retrieval based on spin images per- 
forms better than all the other three approaches. In particular, performance of 
approaches based on curvature histograms and 3D moments is particularly crit- 
ical. This may be accounted to the fact that these two methods only provide a 
global description of the object, and this is often unappropriate for discrimina- 
tion of different models. 
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Abstract. In this paper the authors use the framework of geometric 
algebra for applications in computer vision, robotics and learning . This 
mathematical system keeps our intuitions and insight of the geometry 
of the problem at hand and it helps us to reduce considerably the com- 
putational burden of the problems. The authors show that framework 
of geometric algebra can be in general of great advantage for applica- 
tions using stereo vision, range data, laser, omnidirectional and odometry 
based systems. For learning the paper presents the Clifford Support Vec- 
tor Machines as a generalization of the real- and complex-valued Support 
Vector Machines. 



1 What Is Clifford Geometric Algebra? 

Let Q n denote the geometric algebra of n-dimensions - this is a graded lin- 
ear space. As well as vector addition and scalar multiplication we have a non- 
commutative product which is associative and distributive over addition - this is 
the geometric or Clifford product. A further distinguishing feature of the algebra 
is that any vector squares to give a scalar. The geometric product of two vectors 
a and b is written ab and can be expressed as a sum of its symmetric and anti- 
symmetric parts ab = ab + aAb. The outer or wedge product of two vectors is a 
new quantity which we call a bivector. We think of a bivector as a oriented area 
in the plane containing a and 6 , formed by sweeping a along b. The outer prod- 
uct is immediately generalizable to higher dimensions - for example, (aAb)Ac, a 
trivector , is interpreted as the oriented volume formed by sweeping the area aAb 
along vector c. The outer product of k vectors is a fc-vector or fc-blade, and such 
a quantity is said to have grade k. A multivector (linear combination of objects 
of different type) is homogeneous if it contains terms of only a single grade. 

In an n-dimensional space we can introduce an orthonormal basis of vectors 
{<Tj}, i = 1 such that = Sij. This leads to a basis for the entire 

algebra: 

1, {o-j}, {CTiAo-j}, {<TiAajAa k }, ■■■, a 1 Aa 2 A...Aa n . (1) 



A. Sanfeliu et al. (Eds.): CIARP 2004, LNCS 3287, pp. 25-36, 2004. 
(c) Springer- Verlag Berlin Heidelberg 2004 
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Note that the basis vectors are not represented by bold symbols. Any multivector 
can be expressed in terms of this basis. In this paper we will specify a geometric 
algebra Q n of the n dimensional space by G P , q ,ri where p, q and r stand for 
the number of basis vector which squares to 1, -1 and 0 respectively and fulfill 
n=p+q+r. Its even subalgebra will be denoted by Gp q r - For example Qq 2 o ^ ias 
the basis 

{1, cr l5 og, cqAo^}, (2) 

where af=-l,a 2 =-l. This means p= 0, q= 2 and r=0. Thus the dimension of this 
geometric algebra is n=p + q + r— 2. 

In the n-D space there are multivectors of grade 0 (scalars), grade 1 (vec- 
tors), grade 2 (bivectors), grade 3 (trivectors), etc... up to grade n. Any two 
such multi vectors can be multiplied using the geometric product. Consider two 
multivectors A r and B s of grades r and s respectively. The geometric product 
of A r and B s can be written as 

A r B s = (AB) r+s + (AB) r+s _ 2 + . . . + (AB) |r _ g| (3) 

where (M) t is used to denote the t-grade part of multivector M , e.g. consider 
the geometric product of two vectors ab = (ab) o + ( ab } 2 = a ■ b + a A b. 

For an detailed introduction to geometric algebra the reader should resort to 

[1-3]- 

2 Body-Eye Calibration 

The so-called hand-eye calibration problem involves the computation of the 
transformation between a coordinate system attached to a robotic hand and 
the camera on top of it. Since we want calibrate a binocular head with a mobile 
robot, from know on we will call this task as the body-eye calibration problem. 

The robot-to-sensor relation can be seen as a series of joints Ji, J 2 , J n 
(where a rotation about joint Ji affects all joints A+i, ..., J n ) and a measurement 
system U which is rigidly attached to the last joint J n . The problem can be stated 
as the computation of the transformations M i, M 2 , ..., Af n _i between the robot 
frame and the last joint and the transformation M n between the last joint and 
the measurement device U, using only data gathered with U . The procedure 
consists of two stages. The first stage computes the screw axes of the joints, and 
the second stage uses these axes to compute the final transformation between 
the coordinate systems. 

2.1 Screw Axes Computation 

To compute the axes of rotation, we use a motion estimator, for details see [2] . 
Each joint Ji is moved in turn while leaving the rest at their home position 
(see Figure l.a) . From the resulting motor Mi, the axis of rotation Si can be 
extracted, , for details see [2]. For our particular robot, the sequence of motions 
is presented in Figure l.b. 
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Fig. 1. (a) [upper row) Estimation of the screw axes, (b) [lower row) Correction of 
the rotation and relocation of the screw axes. 



2.2 Calibration 



Our algorithm will produce a set of lines Si in the camera’s coordinate system. 
Once these axes are known, the transformation taking one point x^ measured 
in the camera’s framework to the robot’s coordinate system is easy to derive, 
provided that we know the angles an applied to each joint Ji. Basically, the algo- 
rithm undoes the implicit transformations applied on the camera’s framework by 
first rotating about joint Jk and then translating the joint (and the framework, 
along with the rest of the joints) to the origin (see Figure l.b). 

The functions used in our algorithm are defined as follows: 



nearest(cc) = — 



(e • x) ■ x 



e ■ [(e • x) ■ x] 

. t _ 

makeTranslator(t) = 1 + — e, 

Oi Ot 

lineToMotor(_L a , a) = cos( — ) + sin( — )[e A [e^m) - (ei 2 3 n)\, 



( 4 ) 

( 5 ) 

(6) 



where L a = m + e A n, and n = 1. The function nearest (a:) returns the point 
on x which is nearest to the origin, makeTranslator(t) returns a translator 
displacing by an amount t, and lineToMotor(T a , a), simply returns a motor 
that rotates a radians about the axis L a . 
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Fig. 2. (a) Reconstruction without calibration, (b-d) Relocation of the screws, 
(e) Comparison of the final reconstruction with the real view. 



3 Inverse Kinematics and Object Manipulation 

In this section we show how to perform certain object manipulation tasks in the 
context of conformal geometric algebra. First, we solve the inverse kinematics for 
a Pan-Tilt unit so that the binocular head will be able to follow the end-effector 
and to position the gripper of the arm in a certain position in space. Then, we 
we show how to grasp an object in space. 



3.1 Inverse Kinematics for a Pan-Tilt Unit 

In this task we apply a language of spheres for solving the inverse kinematics, 
this can be seen as an extension of an early approach [1], when a language of 
points, lines and planes were used instead. 

In the inverse kinematics for a pan-tilt unit problem we aim to determine 
the angles 9 tut and 9 pan of stereo-head, so that the cameras fix at the point pt- 
We will now show how we find the values of 9 pan and 9 t ut using the conformal 
approach. The problem will be divided in three steps to be solved. 

Step 1 : Determine the point p 2 . 

When the 9 t ut rotes and the bases rotate ( 9 pan ) around the l y (see Fig. 3. a), the 
point p 2 describes a sphere si. This sphere has center at the point p\ and radius 

d'2- 




( 7 ) 
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Fig. 3. a) Point P2 given by intersection of the plane m and the spheres si and S2. (b) 
Algorithm simulation showing the sphere containing the cube, (c) Image of the object 
we wish to grasp with the robotic arm. (d) The robot “Geometer” grasping a wooden 
cube. 



Also the point pt. can be locked from every point around it. that is the point p 2 
is in the sphere: 




Where d 3 is the distance between point pt and the cameras, and we can calculate 
d 3 using a Pythagorean theorem d\ = D 2 — d 2 , where D is the direct distance 
between p t and p\. We have restricted the position of the point p 2 , but there is 
another restriction: the vector going from the P 2 to the point pt must be live at 
the plane 7Ti generated by the l y axis (l* = po A pi A e^ ) and the point p tl as 
we can see in Fig. 3. a So that P 2 can be determined by intersecting the plane 7Ti 
with the spheres Si and s 2 as follows 

nl = l*Ap t , Pp 2 = si A m A s 2 . ( 9 ) 

Step 2 : Determine the lines and planes. 

Once P 2 have been determined, the line Z 2 and the plane 7 t 2 can be defined. This 
line and plane will be useful to calculate the angles 9 t ut and 9 pan . 



l * 2 =pi Ap 2 A eoo, 



K Ae 3 - 



(10) 
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Step 3: Find the angles Quit and 9 pa n- 

Once we have all the geometric entities, the computation of the angles is a trivial 
step. 



cos {9 pan) 




COS {9 tilt) 



/* /* 
J 1 ' V 




( 11 ) 



3.2 Grasping an Object 

Other interesting experiments involve tasks of grasping objects. First, we con- 
sider only approximately cubic objects (i.e., objects with nearly the same width, 
length, and height). We begin with four non-coplanar points belonging to the 
corners of the object and use them to build a sphere. With this sphere, we can 
make either a horizontal or transversal section, so as to grasp the object from 
above, below, or in a horizontal fashion. Figure 3.b shows the sphere obtained 
using our simulator; the corners of the cube are shown in Figure 3.c; and Figure 
3.d shows the robot arm moving its gripper toward the object after computing 
its inverse kinematics. 



4 Learning: Clifford Valued Support Vector Machines 

In this section we will present the Clifford valued Support Vector Machine 
(CSVM). A CSVM will be a multivector generalization of the real and com- 
plex valued Support Vector Machines [4]. 

4.1 Linear Clifford Support Vector Machines for Classification 

For the case of the Clifford SVM for classification we represent the data set in 
a certain Clifford Algebra G n where n = p + q + r, where any multivector base 
squares to 0, 1 or -1 depending if they belong to p, r, or s multivector bases re- 
spectively. Each data ith-ve ctor has multivector entries Xi = [xa,Xi 2 , ...,XiD] T , 
where x t j £ Q n and D is its dimension. Thus the ith-ve ctor dimension is Dx2 n . 
Each data ith-ve ctor x t £ Q® of the N data vectors will be associated with 
their labels as follows: {x a , y il ),(x l 2 , y i2 ),...,(x tj , y i:j ),...,(x. lD , y iD ), where 
each y ^ = y ijs + y ijai + y ij(T2 + ... + £ {±1 ±a 1 ± <j 2 ... ± I}, where the first 

subindex s stands for scalar part. The 2 n classification problem is to separate 
these multivector-valued samples into 2" groups by selecting a right function 
from the set of functions {f(x) = w* T x + b, x,w £ Q® , b £ Q® . The optimal 
weight vector will be 

w = [w 1 ,w 2 ,...,w d ] t £ g°. (12) 

Let us see in detail the last equation 

f(x) = w* T x + b= [w\,w* 2 , ...,w* D \[(xi,x 2 , •••, x D ]' r + b 

D 

= '52 w i x i + b ’ 

i= 1 



(13) 
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where w*Xi corresponds to the Clifford product of two multivectors and w* is 
the conjugated of the multi vector w. 

We introduce now a structural risk functional similar to the real valued one 
of the SVM and use loss function similar to the function insensitive f of Vapnik. 

1 1 

min -w* T w + C • (14) 

J-i 

subject to Coef s {yij)Coef s {f(xij)) > 1 - Coef s (£ij) 

Coef Ul {y ij )Coef <Jl {f{x ij )) > 1 - Coef ai (£ij) 

C'Oefi(yij)Coef I (f(xi j )) > 1 - Coef^&j) 

CoefsHij ) > 0, Coef ai (£ij) > 0, Coefifaj) > 0, j = 1, 

where the subindex i = 1, D. 

The dual expression of this problem can be derived straightforwardly. Firstly 
let us consider the expression of the orientation of optimal lryperplane. 

Since the Wi = [wn,Wi 2 , ..., w i £>] T , each of the w,j is given by the multivec- 
tor 

Wij = w is + w iai ai + ... + w icrn a n + w i<7l(72 aia 2 + ... + wul. (15) 
Each component of these weights are computed as follows: 

i i 

VJis ^ v ((^zs )j 1 ^ ' ((^b<7i (.Viai )j ) (*^h(7i )j ••*; 

i=i j= i 

i 

Wil = 'y ( (16) 

J=1 

According the Wolfe dual programing [4] the dual form reads 

1 i D 

min ^(w* T w)-'^2(^2({a is ) j + ... + (a i<Tl<72 )j + ... + (a iI ) j )j (17) 
j= 1 i=l 

subject to a T ■ 1 = 0, where the entries of the vector 

CL [O s , 0(7^ , 0(7 2 , ■•., 0(7 1( 72, O/] 

are given by 

= [[(ai»i)(yisi), (a2si)(y2si), ..., {a D si)(yD S1 )], ■■■, 

[( a lsi)(2/lsi)i ( U2sl){y2sl ), ( a Ds l){jjDsl )] , 

[(o^lcri \)(yicr\ i)? (^2cri 1 )(y2cri 2 ), •••? (^£><7! ) (l/Dcrx 1 )] , •••, 




( 18 ) 
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[(aio-i;)(j/l<T 1 ;), (02 <t 1 ;)(2/2ct 1 (), (oDcti ; ) (j/DtJ! j )] 



T 

a I = 



[(ai/l)(3/l/l), (a2/l)(l/2/i), (a_D/i)(j/D/i)], 



[(aui)(yiii), (a 2 ii)(y2ii), (oD/z)(2to/i)] , 



(19) 



note that each data ith-ve ctor, i = 1, ...IV, has D multivector entries and after 
the training we take into account not N but l ith-ve ctors which is the number of 
the found support vectors each one belonging to Q® . Thus a T has the dimension: 
(D x l) x 2", the latter multiplicand corresponds to the length of a multivector 
of Ga- 
in a T -1 = 0,1 denotes a vector of all ones, and all the Lagrange multipliers 
should fulfill 0 < ( a is )j < C, 0 < (a i<7l )j < C, ..., 0 < (a ia 1<T2 )j < C, ..., 
0 < (iai)j < C for i = 1, D and j = 1, 1. 

We require a compact an easy representation of the resultant GRAM matrix 
of the multi-components, this will help for the programing of the algorithm. For 
that let us first consider the Clifford product of ( w* T w ), this can be expressed 
as follows 



w w = (w w) 



II- 



(20) 



Since w has the components presented in equation (16), the equation (20) can 
be rewritten as follows 



w*w = a* s ( x*x) s a s + ... + a* s (x*x) 



a *l 1 ( X * X )s a s 



a r (x x) a s 



a *l 1 ( X * X )a 1 a 2 a ^2 



*T 



a s {x xjjdi 
x*x) T aj + 



dj{x*x) <7i a a i 



a r (x x 



(21) 

+ - + a * T i (x*x) iai . 



Renaming the matrices of the t-grade parts of (x*x) t , we rewrite previous equa- 
tion as: 



w*w = a* T s H s a s T a*^H ai a ai + ... + H ai(J . 2 a rJia2 + ... + a^Hjaj + 
d ai H s a s T a ai H ( y 1 a ai T a CTl G2 a ai a2 T ...To ai Hjdj T 



a* j H s a s T a* j H ai a ai + ... + a* j H ailT2 a ai <j 2 + ... + a* j H jaj. (22) 

These results help us finally to rewrite equation (17) as a compact equation 
as follows 

1 1 i 1 

min -w* T w + C -^2$ = -a* T Ha + C (23) 
i= i l=i 

subject to Coef s (yij)Coef s (f(xij))>l-Coef s (£ij), (24) 

where a is given by equation (18). 
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H is a positive semidefinite matrix which is the expected Gramm matrix. 
This matrix in terms of the matrices of the t-grade parts of (x*x) t is written as 
follows: 



II = 



hh^h , t , 2 



H (t\(T 2 •••Hi 



H 

H 



T 

<72 



II ... H ,, // ,,„ 2 ... //,//_ 

Hl 1 H s ...H tT1 ' T2 ..H,H S H„ X 



(25) 






note that the diagonal entries equal to H s and since H is a symmetric matrix 
the lower matrices are transposed. 

The optimal weight vector w is as given by equation 12. 

The threshold b € Q ® can be computed by using KKT conditions with the 
Clifford support vectors as follows 



b = \bib 2 b 3 ...b D 

( bi s + bi ai (Ti + ... + bi aicr2 ai<j2 + + bul) 

(&2s + b2<j 1 o’i + ... + &2<ri <T 2 a i Vi + + b 2 il)-.. 

(bos + bDax&l + + frz)<Ti<72 tT l (7 2 + + boil) 

l 

= - w * Tx i)/ 1 - 

i=i 



(26) 



The decision function can be seen as sectors reserved for each involved class, 
i.e. in the case of complex numbers (f?i,o,o) or quaternions (t/ 0 , 2 , 0 ) w e can see 
that the circle or the sphere are divide by means spherical vectors. Thus the 
decision function can be envisaged as 



y = csigrim 




= csign m 



w* T x + b 



1 

= csigrim °yj)( x f x ) + b \- 

i=i 



(27) 



where m stands for the state valency, e.g. bivalent, tetravalent and the operation 
“o” is defined as 



(ctj o yj) =< aj > 0 < y :j >0 + < ctj >i< y 3 >1 <J\ + 

...+ <a j > 2 n<y j >2n I, (28) 

simply one consider as coefficients of the multivector basis the multiplications 
between the coefficients of blades of same degree. For clarity we introduce this 
operation “o” which takes place implicitly in previous equation (16). 

Note that the cases of 2-state and 4-state (Complex numbers) can be solved 
by the multi-class real valued SVM, however in case of higher representations like 
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the 16-state using quaternions, it would be awkward to resort to the multi-class 
real valued SVMs. 

The major advantage of this approach is that one requires only one CSVM 
which even can admit multiple multivector inputs. A naive and time consuming 
approach will be to use a a set of real valued SVM. 

4.2 Nonlinear Clifford Valued Support Vector Machines 
for Classification 

For the nonlinear Clifford valued classification problems we require a Clifford 
valued kernel k (x,y). fn order to fulfill the Mercer theorem we resort to a 
component-wise Clifford- valued mapping 

x £ G n — > <P(x) = <2> s (a;) + ( P ai ai + (29) 

... + a 2 (x)a 2 + ... + I<Pi(x) £ G n . 

In general we build a Clifford kernel k{x m ,Xj) by taking the Clifford product 
between the conjugated of x m and Xj as follows 

k(x m ,Xj) = $(x)*$(x), (30) 

note that the kind of conjugation operation ()* of a multivector depends of the 
signature of the involved geometric algebra Q p , q , r - 

Quaternion- valued Gaussian window Gabor kernel function (we use here i = 
0203) 3 = — 0301) fe = 0402 ): 

The Gaussian window Gabor kernel function reads 

k(x m , x n ) = g(x m , x n )exp^ lv, ° {Xm ^ Xn) (31) 

where the normalized Gaussian window function is given by 

. x 1 \\X m -X n \£ 

g{x m ,x n ) = exp ^ (32) 

V 27 Tp 

and the variables Wq and x m — x n stand for the frequency and space domains 
respectively. 

As opposite as the Hartley transform or the 2D complex Fourier this kernel 
function separates nicely the even and odd components of the involved signal, 

i.e. 

k(Xni) X n ) — /^(Xtt,, X n ^ s -f- k(x m: 3 ?n)<T 2 <T 3 “h k[Xmi ^71)03(71 k(Xrm *Kn)cricr2 

= g(xm,x n )cos(wo Xm)cos(wo Xm) + g(x m , Xn)cos(-Wo Xm)sin{vjQ Xm)i 
+g(xm,x n )sin(wo x m )cos(\Vo Xm.)j + g(x m , x„)sm(wj X m )sin(\VQ Xm)k. 

Since g(x m , x n ) fulfills the Mercer’s condition it is straightforward to prove that 
k(x m ,x n ) u in the above equations satisfy these conditions as well. 

After we defined these kernels we can proceed in the formulation of the SVM 
conditions. We substitute the mapped data <P(x) = < ${x) > u into the 
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linear function f(x) = w* T x + b = w* T <P(x) + b. The problem can be stated 
similarly as equations (15-17). In fact we can replace the kernel function in 
equations 24 to accomplish the Wolfe dual programming and thereby to obtain 
the kernel function group for nonlinear classification 

H s = [k s (x rn ,Xj)] m j_ l l (33) 

H ui = [k < 7 l {x rn iXj)\ m j_^ ; 



Hi — [ki{ x m,Xj)] m j = i t ^i ■ 

In the same way we use the kernel functions to replace the the dot product 
of the input data in the equation (27). In general the output function of the 
nonlinear Clifford SVM reads 



y = csigrim 


= csigrim 


w* T <P(x) 


l 

= csigrim^y^Xaj 0 

3 = i 


; j,x ) + b 



(34) 



where m stands for the state valency. 

Next we present the well known 2-D spiral problem to the 3-D space. This 
experiment should test whether the CSVM would be able to separate three 1-D 
manifolds embedded in R 3 . In Figure 4 on can see that the problem is nonlinear 



6 '1 

4 ^ 




Fig. 4. 3D spiral with three classes. The marks represent the support vectors found by 
the CSVM. 
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separable. The CSVM used the kernel given by the equation 33. The CSVM 
found 16 support vector for 21 support vector for / 2 (f) (in the middle) 

and 16 support vector for / 3 (f). Note that the CSVM indeed manage to separate 
the three classes. If we think in a real valued SVM (naive approach), one will 
require to do the job three SVMs. 

5 Conclusions 

In this article we have chosen the coordinate-free system of Clifford or geometric 
algebra for the analysis and design of algorithms useful for perception, action 
and learning. 

Future intelligent machines will necessarily require of a powerful geometric 
language for reasoning at high level. The author believes that Clifford geometric 
algebra is a promissory mathematical system for representing and processing 
complex geometric data in real time. 
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Abstract. In this paper we propose a new technique to perform figure- 
ground segmentation in image sequences of scenarios with varying illu- 
mination conditions. Most of the algorithms in the literature that adapt 
color, assume smooth color changes over time. On the contrary, our tech- 
nique formulates multiple hypotheses about the next state of the color 
distribution (modelled with a Mixture of Gaussians -MoG-), and val- 
idates them taking into account shape information of the object. The 
fusion of shape and color is done in a stage denominated ’sample con- 
centration’, that we introduce as a final step to the classical CONDEN- 
SATION algorithm. The multiple hypotheses generation, allows for more 
robust adaptions procedures, and the assumption of gradual change of 
the lighting conditions over time is no longer necessary. 



1 Introduction 

Color is a visual cue that is commonly used in computer vision applications, such 
as object detection and tracking tasks. In environments with controlled lighting 
conditions and uncluttered background, color can be considered a robust and 
invariant cue, but when dealing with real scenes with changing illumination and 
confusing backgrounds, the apparent color of the objects varies considerably 
over time. Thus, an important challenge for any tracking system to work in real 
unconstrained environments, is the ability to accommodate these changes. In 
the literature, the techniques that cope with change in color appearance can 
be divided in two groups. On the one side, there is a group of approaches that 
search for color constancy (e.g. [2]). But in practice, these methods work mostly 
on artificial and highly constrained environments. On the other hand, there are 
the techniques that generate a stochastic model of the color distribution, and 
adapt this model over time, usually based on weighting functions of previous 
color distributions [6] [7] [8]. The drawback in all of these approaches is that they 
assume that color varies slowly and that it can be predicted by a dynamic model 
based in only one hypothesis. However, this assumption is not enough to cope 
with general scenes, where the dynamics of the color distribution might follow 
an unknown and unpredictable path. 

* This work was supported by CICYT projects DPI2001-2223 and DPI2000-1352-C02- 
01, and by a fellowship from the Spanish Ministry of Science and Technology. 
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Fig. 1. Example frames of a sequence with time- varying color illuminant and its cor- 
responding color distributions (in normalized rg color space). 



In order to cope with these drastic changes, we propose a framework that 
uses multiple hypotheses about the future state of the color distribution. In 
a previous work [5] , we have suggested a similar multihypotheses framework to 
track objects in which color could be approximated by an unimodal distribution, 
represented by a histogram. The main contribution of the present paper consists 
of applying the fusion of shape and color information in a final stage of the 
CONDENSATION algorithm (that we call ‘sample concentration’) in order to 
deal with multicolored objects. To achieve this, the color of the object (and 
background) has been approximated by a MoG , which number is automatically 
initialized by an unsupervisecl algorithm. At each iteration, an offline learned 
dynamic model will generate the hypotheses about probable future states of the 
Gaussian mixture, that will be weighted depending on ‘the quality’ of the a 
posteriori probability map of the object computed with each of them. 

A detailed description of the method is explained in the following sections. In 
Section 2 the object color model and initialization is given. The process of adjust- 
ing the color parameters over time to a dynamic model is explained in Section 
3. And next, in Section 4, the complete tracking algorithm and model adaption 
is described in detail. Results and conclusions are presented in Sections 5 and 6, 
respectively. 

2 Color Model 

In order to represent the color distribution of a monochrome object, color his- 
tograms have been demonstrated to be an effective technique (e.g. [5]). However, 
when the object to be modeled contains regions with different color, the number 
of pixels representing each color can be relatively low and a histogram represen- 
tation may not suffice. In this case, a better approach is to use Gaussian Mixture 
models. The conditional probability for a pixel x belonging to a multi-colored 
object O is expressed as a sum of M a Gaussian components: 

M 0 

H x l c) = (!) 

3 = 1 

Similarly, the background color will be represented by a mixture of M & Gaussians. 

2.1 Model Order Selection 

Similar to the problem of selecting the number of bins in histogram models, using 
MoG conceals the challenge of choosing the number of Gaussian components that 
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Fig. 2. Fitting a MoG to color data in rg colorspace, (a) Initialization with K = 6 
components. (b),(c),(d) are three intermediate steps. The best estimate is the one 
corresponding to K = 4 components. 



better adjust the data. In [7], the model order is selected by iteratively applying 
the EM algorithm and splitting those components having lower a posteriori 
probability p(0 |x). We have observed that this method generates too many 
components, some of them unnecessary, increasing the computational cost of 
the segmentation stage. 

We suggest the use of the method proposed in [3] , based on a Minimum Mes- 
sage Length (MML) criteria that is implemented by a modified EM algorithm. 
This algorithm performs much more stable and generates an initial set with a 
lower number of gaussian components than in [7]. In Fig. 2, we show several 
steps of the fitting process. The algorithm begins with a large number of compo- 
nents (introduced by the user), and iteratively performs an annihilation of those 
components that are not supported by the data. 

2.2 Figure-Ground Segmentation 

In order to segment the object of interest from the background we model both 
color distributions by MoGs and compute the a posteriori probability that a 
pixel x belongs to the object using the Bayes rule: 

fGlxl = p(x\Q)P(Q) .. 

Vy 1 j p(x|0)P(0)+p(x|B)P(B) 1 j 

where B refers to the background, and P (O), P (B) represent the prior proba- 
bility of object and background, respectively. These values are approximated to 
the expected area of the object in the search region (Fig. 3). 




Fig. 3. MoGs of O (the can) and B, and probability density maps, (a) Original image, 
(b) Crosses and dashed lines correspond to B pixels and B Gaussian components, 
respectively, and points and continuous lines are O pixels and Gaussians. (c) p(x| O) 
(d) p(x.\B) (e) p(0 |x). Brighter points correspond to more likely pixels. 
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2.3 Model Paramaterization 

Once we have learnt the initial configuration of the MoG for O and B, we pa- 
rameterize them with the following state vector: 

X E =\p E ,n E ,\ E ,9 E ] (3) 

where e = {0,B}, and p £ = [p^ E \ • ■ ■ ,p£ M ^] contains the prior probabilities of 
each component, fi e = [/4*\ • • • , /4 M ^] are the centroids of each gaussian, the 
eigenvalues of the principal directions are represented by A £ = [A^, • • • , Ae M ^] 
and 0 e = [0^, • • • , 6 E Me ] are the angles between the principal axis of each com- 
ponent with the horizontal. Observe the interest of having a low number of 
gaussian components in order to reduce the dimensionality of this state vector. 
The algorithm described in section 2.1 works properly in the sense that allows 
us to select the lowest number of gaussian components that best represent the 
data. 

3 Learning the Dynamical Model 

One of the stages of the tracking algorithm, consists of propagating the state 
vector from Eq. 3, in order to generate multiple hypotheses about the future 
configuration of the MoG. In order to formulate these hypotheses we formalize a 
dynamic motion model in terms of an auto-regressive Markov process. We model 
color dynamics as a 2nd order process, represented by the expression: 

X e .t = AqX e j-2 + AiX Ett -i + Do + Bo\v t (4) 

where the matrices A 0 , A\ represent the deterministic component of the model, 
Dq is a fixed offset, and -Bo w t is the stochastic component, with w t a vector of 
standard normal random variables with unit standard deviation and BqB^ is 
the process noise covariance. The parameters Aq, Ai, Bq and Dq are learned a 
priori using the MLE algorithm described in [1] . In order to generate the training 
data we use a hand-segmented sequence, where the initial MoG configuration is 
fitted using the unsupervised algorithm described in Section 2.1 and to fit the 
subsequent components we use the EM algorithm. In Fig. 4 we show the evolution 
of the training parameters for the O and B color distributions. 
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Fig. 4. Evolution of the Foreground parameters for the color distribution of a hand- 
segmented training sequence. There are shown the results for the 4 components of the 
MoG. 
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p(x,_, | Z ,_, ) 








Fig. 5. One iteration of the implemented tracking algorithm for the one-dimensional 
case. The weight of each sample is represented by its gray level. The classical imple- 
mentation of the CONDENSATION algorithm uses the steps (a)-(e). In our algorithm, 
we have added the ‘concentration’ step, where the samples are redirected to the local 
maxima. 



4 The Tracking Algorithm 

The basic steps of the tracking algorithm follow the typical procedure of the 
particle filters, but we introduce a modification similar to the idea presented 
in the algorithm ICONDENSATION [4], and in order to ‘direct’ the search for 
the next iteration we concentrate the future hypotheses on those areas of the 
state-space containing more information about the posterior probability p (x|C>) 
(Fig. 5). Moreover, in this final stage we fuse object color and shape information. 
Next, we will briefly describe each one of the steps of the tracking algorithm: 

1. Probability Density Function of Color Distribution. At time t , there 
are available from the previous iteration a set of N samples (n = 1, . . . , 
N) with the same structure than X (eq. 3), parameterizing N color distri- 
butions (Fig. 5a). Each sample has an associated weight 7 t^\. The whole set 
represents an approximation of the a posteriori density function p (X t -i\Z t -i) 
where Z t _ 1 = {zo, ■ ■ . , Zt-i} is the history of the measurements. The goal 
of the algorithm consists of constructing a new sample set {<s/ ' ri ' > , 7T("' 1 } for 
time t. 

2. Sampling from p ( X t -i \Z t -i). The next step in the estimation of p (X t \Z t ) 

consists of sampling with replacement N times the set {5,"^ }, where each 
element has probability n yTy of being chosen (step (b) from Fig. 5). This, 
will give us a set of MoGs parameterizations. Those samples having 

higher weights may be chosen several times, so the new set can have identi- 
cal copies of elements. On the other hand, those distributions having lower 
weights may not be chosen. 

3. Probabilistic Propagation of the Samples. Each sample 5'j"' 1 is prop- 
agated according to the dynamic model learnt during the training stage 
(eq. 4): 

S t M = A 0 S'[% + AS'S + D 0 + B 0 w t 



(5) 
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Fig. 6. Left: Ten samples of MoG from the set The gray level is proportional to 

the weight of the samples. Right: Posteriori probability maps p(0 jx), computed with 
different MoGs. 



~( n ) 

4. Measure and Weight. In this step, each element has to be weighted 
according to some measured features. From the propagated samples Sf we 
construct the corresponding MoG, that are used to compute the probability 
maps p (0|x), for each sample (Fig. 6). The goal is to assign higher weights 

~( n ) 

to the samples <S t generating ‘better’ segmentations of the tracked object. 

~( n ) 

This is done assigning to each sample S t the following weight: 

(n) _ ExewP(°l x ) _ T, x( j W P(° l x ) (n 

* N w N w U 

where W is the interest region around the previous object position (where 
we predict that will be the object), and N w , N w are the number of image 
pixels in and out respectively, of this interest region. 

5. Sample Concentration. In the last stage of the algorithm (Fig. 5f) we 
concentrate the samples around the local maxima, so that in the following 
iteration the hypotheses are formulated around these more likely regions of 
the state space. In our case, this is absolutely necessary because our state 
vector X has high dimensionality (proportional to the number of gaussian 
components) , and if we let the samples move freely, uniquely governed by the 
dynamic model, the number of hypotheses needed to find the samples repre- 
senting a correct color configuration, is extremely high. The ‘concentration’ 
is performed with the following steps: 




Fig. 7. Steps to extract the exact position of the object fusing color segmentation and 
accurate adjustment by deformable contours (commented in the text). 
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Fig. 8. Fitted contour and p (Ojx) map on a sequence with gradual change of illuminant 
and object position. 



(a) The maximum from the set of weights {7 n = 1, . . . , N is taken, and 
using morphologic operations over its probability map image (Fig. 7b), 
a coarse approximation of the object shape is obtained. 

(b) With this rough shape, we eliminate noisy edges from the original image 
(Fig. 7c, d). 

(c) The contour of the object in the previous iteration, is used as initializa- 
tion of a snake, that is adjusted to the previous edge image (Fig. 7e). The 
fusion of color segmentation and shape information increases the robust- 
ness of the system, because even when the color hypotheses give a highly 
rough estimation, they can be corrected using the contour information. 

(d) Once the object has been accurately detected (Fig. 7f), its color distri- 
bution is extracted. A MoG is fitted to this distribution (using the EM 
algorithm), giving a state vector 6> t *. Samples Sj n> are ‘concentrated’ on 
this new distribution as follows: 

S { t n) = (1 - a)S t (n) + aS t * (7) 

where the parameter a governs the level of concentration. In our exper- 
iments we have set a = 0.8. 



5 Results 

In this section, three different experimental results are presented in order to 
illustrate the robustness and different capabilities of our system. In the first ex- 
periment (Fig. 8) we show how the method is able to face a gradual change of 
illumination and object position. The second experiment (Fig. 9) corresponds to 
a sequence with an abrupt change of both position and illumination. Fig. 9a and 
Fig. 9b correspond to the two consecutive frames presented to the algorithm. 
The a posteriori map of the best hypothesis (Fig. 9c) is used to discriminate 
false edges and fit a deformable contour (Fig. 9d,e). In this experiment we have 
constrained the fitting process to affine deformations. Finally, in the third ex- 
periment (Fig. 10) we show the performance of our system in a natural and 
cluttered scene, where we track the movement of an hippopotamus in the water. 
Observe that although the high level of noise and clutter from the scene the 
algorithm is able to perform a good tracking. 
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Fig. 9. Results of an abrupt change of illuminant and object position (commented in 
text). 




Fig. 10. Fitted contour and p (C?|x) map of a natural sequence. 



6 Conclusions 

In this paper we have presented a new approach to the multi-colored object 
tracking under varying illumination environments that dynamically accommo- 
dates the color and shape of the object of interest. The main contribution of this 
work is the fusion of the shape and color information in the probabilistic frame- 
work offered by the particle filter formulation. We also introduce the concept of 
‘concentration’ in the last stage of the CONDENSATION algorithm, what makes 
the system able to cope with a state vector of high dimensionality. 
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Abstract. The mean shift algorithm is an efficient method for tracking object in 
the color image sequence. However, in the infrared object-tracking scenario, 
there is a singular feature space, i.e. the grey space, for representing the infrared 
object. Due to the lack of the information for the object representation, the ob- 
ject tracking based on the mean shift algorithm may be lost in the infrared se- 
quence. To overcome this disadvantage, we propose a new scheme that is to 
construct a cascade grey space. The experimental results performed on two dif- 
ferent infrared image sequences show our new scheme is efficient and robust 
for the infrared object tracking. 



1 Introduction 

Object tracking is a hot topic in computer vision, image sequence analysis, video 
processing and compression, and perceptual user Interface. The essence of the object 
tracking is to determine the position of the object in the images. The mean shift algo- 
rithm is a nonparametric statistical method for seeking the nearest mode of a point 
sample distribution, which originally advocated by Fukunaga [1], and extended and 
brought to the attention of the image processing community by Yizong Cheng [2], 
Recently, this algorithm has been adopted as an efficient technique for a wide variety 
of application, such as image segmentation, and appearance-based object tracking by 
G. R. Bradski [3], D. Comaniciu and Peter Meer [4, 5, 6]. 

Real-Time infrared object tracking is an important and challenging research task in 
many military applications. Infrared object tracking primarily addresses to localize 
and track the infrared thermal object in the infrared image sequences [7, 8]. Unfortu- 
nately, there are some disadvantages for the infrared image, such as extremely low 
signal to noise ratio (SNR), severe background clutter, non-repeatability of the target 
signature, and high ego-motion of the sensor. Commonly, many infrared object track- 
ing algorithms are imposed different constraints, such as no drastic change of the 
object feature [7], and no sensor ego-motion [8]. In this paper, we follow the ‘mean 
shift’ tracking approach, which was proposed by D. Comaniciu et al. [5, 6]. Due to 
the nature of the infrared image, we choose the grey space as the feature space. In 
contrast to the color space, the lack of the information to represent the object will lead 
to the failure to track the infrared object. In order to overcome the disadvantage, we 
propose a new scheme that is to construct a cascade grey space. Then, the infrared 
object can be represented in the cascade grey space. The experimental results per- 
formed on two different infrared image sequences show our new scheme is efficient 
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and robust for the infrared small object tracking, and infrared object tracking in the 
severe clutter background. 



2 Nonparametric Density Estimation and Mean Shift 



There are several nonparametric methods available for probability density estimation 
[9, p.8 1 1 . Kernel estimation is one of the most popular techniques. Let {Xj } i=1 n be an 
arbitrary set of n points in the d-dimensional space R d , the multivariate kernel density 
estimator with kernel K(x) and windows radius (i.e. bandwidth) h, in the point x is 
defined as [9, p.l 10] 



P 0) = 



1 

nh d 






;= l 



( 1 ) 



There are several commonly used kernels, such as Epanechnikov kernel, and Nor- 
mal kernel, etc. The Epanechnikov kernel is an optimum kernel that yields minimum 
mean integrated square error (MISE) 



K e (x) = < 
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0 



c~\d + 2)(l-|x| 2 ) 



if ||x|| < 1 
otherwise 



(2) 



where c d is the volume of the unit d-dimensional sphere, e.g., Cj=2, c,=rt. c 3 =47t/3. 

Let us introduce the profile [2] of a kernel K as a function k: [0, — > R such 

that K (x) = k( X ) . For example, according to (2) the Epanechnikov kernel is 



k E (x) = 



\c-\d + 2)(l-x) 
0 



if x<\ 
otherwise 



(3) 



Employing the profile notation we can write the density estimate (1) as 



Pk( x ) = 
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rtf 
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) 



(4) 



Let us denote 

g( x ) = -k\x) (5) 

assuming that the derivative of k exists for all X £ [0, °°) , except for a finite set of 
points. A kernel G can be defined as 

G(x) = Cg(||x|| 2 ) (6) 

where C is normalization constant. 

Then, the estimate of the density gradient can be defined as the gradient of kernel 
density estimate (4) 
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) can be assumed to be nonzero. Then, the last bracket in (7) 



contains the sample mean shift vector 
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Obviously, the value of p K (x) will reach maximum when M h G (x) equals zero. 
Moreover, the mean shift procedure is defined recursively by computing the mean 
shift vector M b G (x) , and a proof of the convergence can be found in [2]. 



3 Infrared Object Tracking 

In our infrared object- tracking scenario, we follow the ‘mean shift’ tracking approach, 
which was proposed by D. Comaniciu et al. [5, 6]. In the mean shift object-tracking 
algorithm, there are three basic steps. Firstly, in order to characterize the object, a 
specific feature space must be chosen. In [3] the HSV color space was chosen, and in 
[5, 6] the RGB color space was chosen. In infrared object tracking, we choose the 
grey space as the feature space. After choosing the feature space, the object model q 
and the object candidate p are be represented by its probability density function (pdf) 
in the feature space. Secondly, for measuring the similarity between the object model 
and the object candidate, the similarity function must be defined. Alike [5, 6], the 
Bhattacharyya coefficient is chosen. Thirdly, it is to determine the location corres- 
ponding to the object in the current frame. In mean shift tracking algorithm, this step 
uses the gradient information, which is provided by the mean shift vector. 

In our infrared object-tracking algorithm, there is a new idea: We choose the grey 
space as the feature space. In contrast to the color space, the lack of the information to 
represent the object will lead to the failure to track the object. In order to overcome 
the disadvantage, we propose a new scheme that is to construct a cascade grey space. 
The core of this scheme is that the infrared image is convoluted in the x-direction and 
y-direction to generate two grey sub-spaces by a specific one-dimensional filter. The 
cascade grey space is made of these two grey sub-spaces. 



